Exploration of Stability of Countries from May to July 2017 Using GDELT Dataset


Prince Joseph Erneszer Javier
Reynaldo Tugade, Jr.

Executive Summary

This notebook explores certain events extracted from the GDELT Dataset. The size of the dataset is 542Gb worth of compressed open source index of the world's news media. Records and meta information are grouped into three tables namely Events, Mentions and Global Knowledge Graph (GKG). The tables were stored in compressed CSV files separated by date and table type. The use of primitive Python data structures to summarize insights is inherently tricky especially since we are dealing with large quantities of data. Hence, this notebook leveraged Dask to enable parallel and out-of-core computation. Using parallel computing, we achieved to collate general information capturing the potential theoretical impact specific types of event will have on the stability of a country such as the events' sentiment scores (AvgTone) and Goldstein Scale. Results show that for the three months of May-July, 2017, some countries in South America, Northern Africa, Middle East, and Central Asia experienced the most adverse events from a regional stability perspective.

Introduction

The GDELT Dataset

The Global Database of Events, Language, and Tone, which known as GDELT, is a new CAMEO-coded dataset containing geo-located events with global coverage from 1979 to the present. Each record consists of two actors and the action performed by Actor1 upon Actor2. Additionally as stated in the GDELT codebook, a wide array of variables break out the raw CAMEO actor codes into their respective fields to make it easier to interact with the data. Action codes are broken out into their hierarchy with Goldstein ranking scores included. Plus a unique array of georeferencing fields that offer estimated landmark-centroid-level geographic positioning of both actors and the location of the action. Lastly, a new “Mentions” table is added, which records the network trajectory of the story of each event “in flight” through the global media system. In this notebook, we aim to answer which countries were least stable during the period of May to July, 2017 based on the Goldstein scores and Tone of the events in those countries. [1].

Goldstein Scale and Average Tone

Goldstein Scale refers to the intensity of conflict or cooperation depending on the type of event.[2] This value ranges from -10 to 10. The more negative the value, the higher the intensity of conflict while a more positive value indicates more cooperation. Note that this does not depend on the specifics of the event but rather only on the type of event, e.g. military attack. Average Tone is the sentiment of the event based on the words used in the documents. This value ranges from -100 for extremely negative events to +100 for extremely positive events.

Dask for Big Data Processing

With large chunks of records from this dataset coming from a multitude of news sources stored in a set of compressed files, the average personal notebook computer would have much of a difficulty attaining a speedup in data processing[3]. An analysis would mean to take some time if we pursue with just using native Python data structures. In this view, using an algorithm capable of handling multiple cores simultaneously becomes very important. Dask, a specification to encode parallel algorithms using the same Python callables extends further Python's capacity to parallelize complex codebases. It can significantly improve time to explore large amounts of data by effectively managing disk usage and task scheduling. This notebook shows the full extent of using Dask to quickly getting summarized insights from a large set of data.

Methodology

The paper explores the Goldstein and Average Tone values of global and Philippine events from May to July 2017. The methods used included Dask for loading and processing of the data using distributed workers (external computers), Pandas for loading resulting smaller datasets and saving them into CSV files, Matplotlib for visualizing static data, and finally Plotly for visualizing interactive data.

Data Collection and Description

Since the GDELT database is massive, we focused on specific attributes that we wish to explore. We are interested in finding the impacts of events on the stability of a country or a region. Of the three tables GDELT has we concentrated on the events table which contains the following information:

Table 1. Specific GDELT attributes chosen for analysis

Name Type Description
GlobalEventID Integer Globally unique identifier assigned to each event record that uniquely identifies it in the master dataset
Day Integer Date the event took place in YYYYMMDD format
MonthYear Integer Alternative formatting of the event date, in YYYYMM format
Year Integer Alternative formatting of the event date, in YYYY format
Actor1Code String The complete raw CAMEO code for Actor1 (includes geographic, class, ethnic, religious, and type classes). May be blank if the system was unable to identify an Actor1
Actor1Name String The actual name of the Actor1. In the case of a political leader or organization, this will be the leader’s formal name (GEORGE W BUSH, UNITED NATIONS), for a geographic match it will be either the country or capital/major city name (UNITED STATES / PARIS), and for ethnic, religious, and type matches it will reflect the root match class (KURD, CATHOLIC, POLICE OFFICER, etc)
Actor1CountryCode String The 3-character CAMEO code for the country affiliation of Actor1
GoldsteinScale Float Each CAMEO event code is assigned a numeric score from -10 to +10, capturing the theoretical potential impact that type of event will have on the stability of a country
NumMentions Integer This is the total number of mentions of this event across all source documents during the 15 minute update in which it was first seen
AvgTone Numeric This is the average “tone” of all documents containing one or more mentions of this event during the 15 minute update in which it was first seen



These attributes are stored in a compressed file sorted by date. Each compressed file used zip as a mode of compression and further grouped by type of table (e.g., export, mention, GKG).

Exploratory Data Analysis

We performed an exploratory data analysis of the Goldstein Scale and Average Tone of global events from May to July 2017. In summary, we visualized the scatterplot of the events worldwide and in the Philippines. We visualized the top ten countries with most positive and most negative Goldstein scales and average tones. We compared these values with values globally and in the Philippines. When ranking the countries according to Goldstein scale and average tone, the countries with number of events below the 10th percentile were removed from the dataset as the number of events were deemed too few to estimate the general stability of the region.

The main insights from this exploratory analysis are:

  1. Among the ten countries with most negative Goldstein Scores, two are from Africa, six are from Central Asia and the Middle East, and the other two are Venezuela and Serbia and Montenegro.
  2. The country with the most negative Goldstein Score is Central African Republic at -1.80
  3. The global Goldstein average is slightly positive at around 0.49 while the average for the Philippines is slightly negative at -0.27
  4. Among the ten countries with the most negative Average Tone, two are from Africa, three are from the Middle East, and three are from Europe.
  5. The country with the most negative Average Tone is Venezuela at -5.55
  6. The global average tone is -2.05 while the average tone in the Philippines is -2.97

Data Loading and Preprocessing

We first loaded the packages that we needed.

In [1]:
# importing packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import glob
import dask.dataframe as dd
import dask.bag as db
from dask.delayed import delayed
from dask.distributed import Client
from sklearn.externals.joblib import parallel_backend
from dask.diagnostics import ProgressBar

from plotly.offline import plot, iplot, init_notebook_mode
import plotly.graph_objs as go

init_notebook_mode()

We connected with the dask cluster.

In [2]:
# set to run Dask commands in this "cluster"
client = Client('10.233.29.219:8786')

We checked the contents of the folder gdeltv2. The folder contains mentions.CSV.zip, gkg.csv.zip, and export.CSV.zip. From reading the GDELT documentation, the features we need are in the export.CSV.zip files.

In [6]:
# check the first five contents of the folder
path = '/mnt/data/public/gdeltv2/*'
glob.glob(path)[:5]
Out[6]:
['/mnt/data/public/gdeltv2/20170101190000.mentions.CSV.zip',
 '/mnt/data/public/gdeltv2/20170131134500.gkg.csv.zip',
 '/mnt/data/public/gdeltv2/20170210224500.mentions.CSV.zip',
 '/mnt/data/public/gdeltv2/20170213203000.gkg.csv.zip',
 '/mnt/data/public/gdeltv2/20170215044500.export.CSV.zip']

We checked the contents of one export.CSV.zip file. We found the data mentioned in the GDELT documentation but there were no column headers.

In [141]:
# we see three kinds of files
# let's open the contents one by one

# we define sample sets
f2 = ['/mnt/data/public/gdeltv2/20170611004500.export.CSV.zip']

# we import the progress bar
# pbar = ProgressBar()
# pbar.register()

# we load export.CSV.zip into a delayed Pandas dataframe

dfs = [delayed(pd.read_csv)(fn, delimiter='\t', header=None,
                            dtype='str', engine='python') for fn in f2]
df = dd.from_delayed(dfs)
print(f2)
display(df.head().T)
['/mnt/data/public/gdeltv2/20170611004500.export.CSV.zip']
0 1 2 3 4
0 663766551 663766552 663766553 663766554 663766555
1 20160611 20160611 20160611 20160611 20160611
2 201606 201606 201606 201606 201606
3 2016 2016 2016 2016 2016
4 2016.4411 2016.4411 2016.4411 2016.4411 2016.4411
5 REB USA USA USA USA
6 SUICIDE BOMBER UNITED STATES THE US UNITED STATES UNITED STATES
7 nan USA USA USA USA
8 nan nan nan nan nan
9 nan nan nan nan nan
10 nan nan nan nan nan
11 nan nan nan nan nan
12 REB nan nan nan nan
13 nan nan nan nan nan
14 nan nan nan nan nan
15 JOR nan nan RUS RUS
16 JORDANIAN nan nan RUSSIAN RUSSIAN
17 JOR nan nan RUS RUS
18 nan nan nan nan nan
19 nan nan nan nan nan
20 nan nan nan nan nan
21 nan nan nan nan nan
22 nan nan nan nan nan
23 nan nan nan nan nan
24 nan nan nan nan nan
25 0 0 0 1 1
26 1831 081 120 040 040
27 183 081 120 040 040
28 18 08 12 04 04
29 4 2 3 1 1
... ... ... ... ... ...
31 8 4 2 4 2
32 1 1 1 2 1
33 8 4 2 4 2
34 -4.30107526881721 -0.96818810511757 -2.01117318435755 1.98251672782214 1.76991150442478
35 1 3 1 3 3
36 Jordan Shandon, California, United States Sweden Washington, District of Columbia, United States Washington, District of Columbia, United States
37 JO US SW US US
38 JO USCA SW USDC USDC
39 nan CA079 nan DC001 DC001
40 31 35.6552 62 38.8951 38.8951
41 36 -120.375 15 -77.0364 -77.0364
42 JO 249342 SW 531871 531871
43 1 0 0 3 1
44 Jordan nan nan Washington, District of Columbia, United States Russia
45 JO nan nan US RS
46 JO nan nan USDC RS
47 nan nan nan DC001 nan
48 31 nan nan 38.8951 60
49 36 nan nan -77.0364 100
50 JO nan nan 531871 RS
51 1 3 1 3 1
52 Jordan Shandon, California, United States Sweden Washington, District of Columbia, United States Russia
53 JO US SW US RS
54 JO USCA SW USDC RS
55 nan CA079 nan DC001 nan
56 31 35.6552 62 38.8951 60
57 36 -120.375 15 -77.0364 100
58 JO 249342 SW 531871 RS
59 20170611004500 20170611004500 20170611004500 20170611004500 20170611004500
60 http://www.nbcnews.com/storyline/isis-uncovere... http://www.sanluisobispo.com/news/local/articl... http://www.theage.com.au/world/sweden-figured-... http://www.air1.com/news/2017/06/10/Jeff-Sessi... http://www.air1.com/news/2017/06/10/Jeff-Sessi...

61 rows × 5 columns

We encoded the column headers that we found from the GDELT documentation.

In [5]:
# We see that there are no columns in the dataset
# We found the columns in GDELT website

events_columns = ['GlobalEventID', 'Day', 'MonthYear', 'Year', 'FractionDate',
                  'Actor1Code', 'Actor1Name', 'Actor1CountryCode',
                  'Actor1KnownGroupCode', 'Actor1EthnicCode',
                  'Actor1Religion1Code', 'Actor1Religion2Code',
                  'Actor1Type1Code', 'Actor1Type2Code', 'Actor1Type3Code',
                  'Actor2Code', 'Actor2Name', 'Actor2CountryCode',
                  'Actor2KnownGroupCode', 'Actor2EthnicCode',
                  'Actor2Religion1Code', 'Actor2Religion2Code',
                  'Actor2Type1Code', 'Actor2Type2Code', 'Actor2Type3Code',
                  'IsRootEvent', 'EventCode', 'EventBaseCode',
                  'EventRootCode', 'QuadClass', 'GoldsteinScale',
                  'NumMentions', 'NumSources', 'NumArticles', 'AvgTone',
                  'Actor1Geo_Type', 'Actor1Geo_Fullname',
                  'Actor1Geo_CountryCode', 'Actor1Geo_ADM1Code',
                  'Actor1Geo_ADM2Code', 'Actor1Geo_Lat', 'Actor1Geo_Long',
                  'Actor1Geo_FeatureID', 'Actor2Geo_Type',
                  'Actor2Geo_Fullname', 'Actor2Geo_CountryCode',
                  'Actor2Geo_ADM1Code', 'Actor2Geo_ADM2Code',
                  'Actor2Geo_Lat', 'Actor2Geo_Long', 'Actor2Geo_FeatureID',
                  'ActionGeo_Type', 'ActionGeo_Fullname',
                  'ActionGeo_CountryCode', 'ActionGeo_ADM1Code',
                  'ActionGeo_ADM2Code', 'ActionGeo_Lat', 'ActionGeo_Long',
                  'ActionGeo_FeatureID', 'DATEADDED', 'SOURCEURL']

We defined a function that can load the contents of a set of file paths and return a dask dataframe. The preprocessing performed by the function are:

  1. Dropping of rows with null values
  2. Removal of non-numerical characters from numerical values. e.g. 42#.5 -> 42.5
  3. Conversion of numerical data to float
  4. Selection of features needed for analysis
In [6]:
# We are ready to load a larger dataset

# Import regex
import re

# we import the progress bar
pbar = ProgressBar()
pbar.register()


def load_events(filenames):
    '''
    Load events data from list of filenames
    Select necessary columns, drop null values
    Convert numerical values to float
    Return the cleaned dask dataframe
    '''
    f_events = filenames

    # we load export.CSV.zip into a delayed Pandas dataframe
    dfs_events = [delayed(pd.read_csv)(fn, delimiter='\t', header=None,
                                       dtype='str', names=events_columns, engine='python') for fn in f_events]
    df_events = dd.from_delayed(dfs_events).set_index('GlobalEventID')

    # Drop null values
    df_events = df_events.dropna(
        subset=['GoldsteinScale', 'NumMentions', 'AvgTone', 'Actor1Geo_Lat', 'Actor1Geo_Long', 'Actor1Geo_CountryCode', 'Actor1Geo_Fullname'])
    print("> Null values dropped.")

    # Numerical datapoints to clean
    to_clean = ['Day', 'MonthYear', 'GoldsteinScale', 'NumMentions',
                'NumSources', 'NumArticles', 'AvgTone', 'Actor1Geo_Lat', 'Actor1Geo_Long']

    # Numerical datapoints to convert (lat and long returns error for some reason)
    to_conv = ['Day', 'MonthYear', 'GoldsteinScale', 'NumMentions',
               'NumSources', 'NumArticles', 'AvgTone']

    # Clean numerical datapoints by removing non-numerical data
    pattern = re.compile('#')
    for col in to_clean:
        df_events[col] = df_events[col].str.strip().str.replace(pattern, '')
    print("> Removed non-numerical values from numerical dataset.")

    # Convert to numerical data
    df_events[to_conv] = df_events[to_conv].astype(float)
    print("> Converted numerical data to float.")

    # Check average goldstein score, avgTone of each country/location
    # Columns to keep
    keep_cols = ['Day', 'MonthYear', 'GoldsteinScale', 'NumMentions',
                 'NumSources', 'NumArticles', 'AvgTone', 'Actor1Geo_Lat', 'Actor1Geo_Long',
                 'Actor1Geo_CountryCode', 'Actor1Geo_Fullname']

    # Extract the needed data
    df_ = df_events[keep_cols]
    print("> Selected needed data from Events data.")
    print(f"> Type of dataframe {type(df_)}")

    return df_

Since we were only concerned with dates from May to July 2017, the datasets contained in .export.CSV.zip files with filenames starting in 201705, 201706, or 201707 were loaded.

In [12]:
# load the dataset
f_events = glob.glob('/mnt/data/public/gdeltv2/20170[5, 6, 7]*.export.CSV.zip')
df_ = load_events(f_events)
> Null values dropped.
> Removed non-numerical values from numerical dataset.
> Converted numerical data to float.
> Selected needed data from Events data.
> Type of dataframe <class 'dask.dataframe.core.DataFrame'>

As another filter and to ensure that only data with dates from May to July 2017 were selected, only rows with MonthYear values of 201705, 201706, or 201707 were selected.

In [13]:
# Filter only MonthYears June 2017
df_ = df_[df_["MonthYear"].isin([201705, 201706, 201707])]

New columns containing the number of mentions per event X Goldstein Scale or AvgTone were created. These will be used for calculating the average Goldstein Scale and AvgTone per country given by:

\begin{equation} Avg Score = \frac{\sum(NumMentions * Score)}{\sum{NumMentions}} \end{equation}

In [14]:
# Average Goldstein and Avg Tone

# Average Goldstein score pe country weighted by number of mentions (importance)
# New column added containing the GoldsteinScale and Avg Tone * num Mentions
df_['goldstein * num_mentions'] = df_['GoldsteinScale']*df_['NumMentions']
df_['avgtone * num_mentions'] = df_['AvgTone']*df_['NumMentions']
print("> New weighted Goldstein Scale column created.")
print("> New weighted avgtone column created.")
> New weighted Goldstein Scale column created.
> New weighted avgtone column created.

Average Goldstein and Average Tone per Country

All data selected were then grouped according to location and by month. The values in each feature were summed up by group.

In [15]:
# Group by country and compute
df_by_country_month = df_[['Actor1Geo_CountryCode', 'MonthYear', 'NumMentions', 'goldstein * num_mentions',
                           'avgtone * num_mentions']].groupby(by=['Actor1Geo_CountryCode', 'MonthYear']).sum().compute()
print("> Data grouped by Actor1Geo_CountryCode and computed (sum) successfully.")
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://10.233.121.186:52896 remote=tcp://10.233.29.219:8786>
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://10.233.121.186:52928 remote=tcp://10.233.29.219:8786>
> Data grouped by Actor1Geo_CountryCode and computed (sum) successfully.

The table below shows the first five results of the grouped values.

In [16]:
# Head of summed values by country by month
df_by_country_month.head()
Out[16]:
NumMentions goldstein * num_mentions avgtone * num_mentions
Actor1Geo_CountryCode MonthYear
AE 201705.0 73916.0 107106.0 23181.924722
AF 201705.0 237120.0 -331068.8 -978934.886098
AL 201705.0 23071.0 12505.1 -57442.170394
AR 201705.0 27743.0 29108.4 -40456.270233
AS 201705.0 793446.0 659511.9 -870717.060426

The Average Tone and Average Goldstein Scale per country were then calculated following the equation above. The resulting dataset was saved in a csv file for quick reference.

In [17]:
# Get average goldstein and average tone per country per month
# Note: Average Goldstein and Avg Tone = sum(num mentions * score) / sum(num_mentions per country)
df_by_country_month['avg_goldstein'] = df_by_country_month['goldstein * num_mentions'] / \
    df_by_country_month['NumMentions']
df_by_country_month['avg_avgtone'] = df_by_country_month['avgtone * num_mentions'] / \
    df_by_country_month['NumMentions']
df_by_country_month.to_csv("data/df_by_country_month.csv", index=True)
print("> Successfully computed average tone and goldstein per country per month")
print("> Successfully saved dataset to csv")
> Successfully computed average tone and goldstein per country per month
> Successfully saved dataset to csv

We loaded the previously saved dataset into a pandas dataframe.

In [18]:
df_by_country_month = pd.read_csv("data/df_by_country_month.csv")
print("Loaded df_by_country_month csv file.")
Loaded df_by_country_month csv file.

Below are the first five rows in the loaded dataset.

In [19]:
df_by_country_month.head()
Out[19]:
Actor1Geo_CountryCode MonthYear NumMentions goldstein * num_mentions avgtone * num_mentions avg_goldstein avg_avgtone
0 AE 201705.0 73916.0 107106.0 23181.924722 1.449023 0.313625
1 AF 201705.0 237120.0 -331068.8 -978934.886098 -1.396208 -4.128437
2 AL 201705.0 23071.0 12505.1 -57442.170394 0.542027 -2.489800
3 AR 201705.0 27743.0 29108.4 -40456.270233 1.049216 -1.458251
4 AS 201705.0 793446.0 659511.9 -870717.060426 0.831199 -1.097387

Countries with Most Negative and Positive Goldstein and AvgTone Scores

We calculated the Average Tone and Average Goldstein Values per location over the whole 3-month period.

In [84]:
# Get the average values per country for the whole scope of date
df_by_country = df_by_country_month.groupby("Actor1Geo_CountryCode").sum()
df_by_country['avg_goldstein'] = df_by_country['goldstein * num_mentions'] / \
    df_by_country['NumMentions']
df_by_country['avg_avgtone'] = df_by_country['avgtone * num_mentions'] / \
    df_by_country['NumMentions']
print("> Calculated avg_goldstein and avg_avgtone per country for the whole duration considered")
> Calculated avg_goldstein and avg_avgtone per country for the whole duration considered

Since the locations are in FIPS format, we loaded a dictionary containing the location name per FIPS value. The dictionary doesn't contain DA, WI, and YI, so these three were added.

In [86]:
# load country codes dictionary
country_codes = dict(pd.read_csv('fips.csv', index_col='Code').T)

# These are locations not in the dictionary
country_codes["DA"] = ["Denmark"]
country_codes["WI"] = ["Wisonsin"]
country_codes["YI"] = ["Serbia and Montenegro"]

The locations with events having a total number of mentions in the bottom 10 percentile were removed since these events were deemed too few to give a general sense of the regions' stability.

In [87]:
# Only get the number of mentions above the 10th percentile
# This will filter out the least important 10% of events
lowest10 = df_by_country.NumMentions.quantile(0.1)
df_by_country = df_by_country[df_by_country.NumMentions >= lowest10]
print(
    f"> Removed NumMentions less than {lowest10}: least important for plotting")
> Removed NumMentions less than 769.0: least important for plotting

A function that plots a barchart of the ten countries with highest and lowest Goldstein and AvgTone values was defined below.

In [88]:
def plot_barchart(df, value):
    "Plot bar chart of a value (string) per location"

    # Sort the countries by weighted goldstein scale
    _ = len(df)
    print(f"> {_} locations in the dataset")
    to_sort = value
    df_sorted = df.sort_values(
        by=to_sort, ascending=False).reset_index(drop=False)
    print(f"> Sorted according to {to_sort} and reset index")

    # top 10 and bottom 10 countries
    top = df_sorted.iloc[:10, :]
    bottom = df_sorted.iloc[-10:, :]

    # Philippine value
    ph = df_sorted[df_sorted.Actor1Geo_CountryCode == 'RP']
    print("> loaded Philippines value")

    country_names_top = [country_codes[i][0]
                         for i in top.Actor1Geo_CountryCode]
    country_names_bottom = [country_codes[i][0]
                            for i in bottom.Actor1Geo_CountryCode]

    # plot top countries
    plt.barh('Global Average', world)
    plt.barh('Philippines', ph[to_sort])
    plt.barh(country_names_top, top[to_sort])
    plt.barh(country_names_bottom, bottom[to_sort])
    plt.yticks(rotation=0)
    plt.xlabel(f'{value} Scale')
    plt.ylabel("Country Code")
    plt.title(f'Countries with Lowest and Highest {value}')
    plt.tight_layout()
    plt.savefig(f'charts/bar_{value}.png', dpi=150)

The global Goldstein Average was calculated to be 0.49.

In [160]:
# Get the global mean of goldstein score
world = np.sum(df_by_country_month["goldstein * num_mentions"]
               ) / np.sum(df_by_country_month["NumMentions"])
print(f"> loaded global value: {world}")
> loaded global value: 0.49106665355332063

The chart below shows the countries with the most positive and most negative Goldstein scores.

Among the ten countries with most negative Goldstein Scores, two are from Africa, six are from the Middle East and Central Asia, and the other two are Venezuela and Serbia and Montenegro. The country with the most negative Goldstein Score is Central African Republic at -1.80. The global Goldstein average is slightly positive at around 0.49 while the average for the Philippines is slightly negative at -0.27.

In [89]:
# Plot avg goldstein for the whole 3 month period May June Jul 2017
plot_barchart(df_by_country, 'avg_goldstein')
> 230 locations in the dataset
> Sorted according to avg_goldstein and reset index
> loaded Philippines value

The global Average Tone value is -2.05.

In [161]:
# Get the global mean of avgtone
world = np.sum(df_by_country_month["avgtone * num_mentions"]
               ) / np.sum(df_by_country_month["NumMentions"])
print(f"> loaded global value: {world}")
> loaded global value: -2.052939636155747

The chart below shows the countries with the most positive and most negative Goldstein scores.

Among the ten countries with the most negative Average Tone, two are from Africa, three are from the Middle East, and three are from Europe. The country with the most negative Average Tone is Venezuela at -5.55. The global average tone is -2.05 while the average tone in the Philippines is -2.97.

In [162]:
# Plot avg tone for the whole 3 month period May June Jul 2017
plot_barchart(df_by_country, 'avg_avgtone')
> 230 locations in the dataset
> Sorted according to avg_avgtone and reset index
> loaded Philippines value

Goldstein and Average Tone per Coordinate

In this section, we plotted a scatterplot of Goldstein and AvgTone values of 1% of all the events from May to July 2017.

We first obtained a sample of 1% of the dataset which would be plotted in a scatterplot.

In [25]:
# Get a sample for plotting
frac = 0.01
df_events_sample = df_.sample(frac=frac).persist()
print(f"> Obtained a {frac} sample for plotting.")
print("> Selected data persisted into workers.")
> Obtained a 0.01 sample for plotting.
> Selected data persisted into workers.

The sampled dataset was saved in a csv for quick reference.

In [26]:
# Save sample to csv
df_events_sample.compute().to_csv("data/df_events_sample_coord.csv")
print("> df_events_sample_coord saved to csv")
> df_events_sample_coord saved to csv

The dataset saved above was then loaded in a dataframe.

In [27]:
# Load df_events_sample.csv
df_events_sample = pd.read_csv("data/df_events_sample_coord.csv")
df_events_sample.head()
Out[27]:
GlobalEventID Day MonthYear GoldsteinScale NumMentions NumSources NumArticles AvgTone Actor1Geo_Lat Actor1Geo_Long Actor1Geo_CountryCode Actor1Geo_Fullname goldstein * num_mentions avgtone * num_mentions
0 651554801 20170501.0 201705.0 3.4 2.0 1.0 2.0 3.720930 6.41667 2.88333 NI Badagry, Lagos, Nigeria 6.8 7.441860
1 651555864 20170501.0 201705.0 0.0 2.0 1.0 2.0 -1.620746 39.01940 125.75500 KN Pyongyang, P'yongyang-si, North Korea 0.0 -3.241491
2 651554398 20170501.0 201705.0 -2.0 4.0 1.0 4.0 -8.097166 NaN NaN NaN NaN -8.0 -32.388664
3 651555218 20170501.0 201705.0 1.0 2.0 1.0 2.0 -5.263158 35.84690 38.54430 SY Tabqa, Ar Raqqah, Syria 2.0 -10.526316
4 651554564 20170501.0 201705.0 -2.0 10.0 1.0 10.0 1.926164 13.00000 105.00000 CB Cambodia -20.0 19.261637

The chart below shows the distribution of Avg Tone values in the sampled dataset. The global average value in the sample was found to be -2.01.

In [217]:
plt.figure(figsize=(5,4))
plt.hist(df_events_sample.AvgTone, bins=30);
plt.title("Histogram of AvgTone in the Sampled Dataset")
plt.ylabel("Counts of Events")
plt.xlabel("AvgTone Value")
plt.tight_layout()
plt.savefig("charts/histogram_avgtone.png", dpi=150)
print(f"> Mean of AvgTone in the sampled dataset: {np.mean(df_events_sample.AvgTone)}")
> Mean of AvgTone in the sampled dataset: -2.0118533449488623

The chart below shows the distribution of Goldstein values in the sampled dataset. The global average value in the sample was found to be 0.56.

In [218]:
plt.figure(figsize=(5,4))
plt.hist(df_events_sample.GoldsteinScale, bins=30);
plt.title("Histogram of Goldstein Scale in the Sampled Dataset")
plt.ylabel("Counts of Events")
plt.xlabel("Goldstein Value")
plt.tight_layout()
plt.savefig("charts/histogram_goldstein.png", dpi=150)
print(f"> Mean of Goldstein in the sampled dataset: {np.mean(df_events_sample.GoldsteinScale)}")
> Mean of Goldstein in the sampled dataset: 0.5601541523789445

The longitudes and latitudes of each event and the corresponding Goldstein and Avg Tone values were extracted from the dataset.

In [177]:
# Plot the longitudes and latitudes color coded according to Goldstein value
y = df_events_sample['Actor1Geo_Lat']
x = df_events_sample['Actor1Geo_Long']
goldstein = df_events_sample['GoldsteinScale']
avgtone = df_events_sample['AvgTone']
num_mentions = df_events_sample['NumMentions']
print("> Coordinates and other valuable data to be visualized computed successfully.")
> Coordinates and other valuable data to be visualized computed successfully.

The events were plotted according to coordinates and color coded by Goldstein Score (red being most negative, green being most positive, and yellow as most neutral). The size of each marker represents the number of mentions of the event (importance). Although negative values are generally present globally, there are observable prominent red patches in some parts of the US, South America, Middle East, and Africa.

In [292]:
# Goldstein values vs Latitude and Longitude
# Color is goldstein while size is importance (num_mentions)
plt.style.use('default')
f, ax = plt.subplots(figsize=(11,5))
ax.scatter(x, y, c=goldstein, marker='o', s=num_mentions/3, cmap='RdYlGn', alpha=0.75)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.axis('equal')
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.title('Goldstein score per event from May to July 2017')
plt.tight_layout()

# Save figure
plt.savefig("charts/scatter_goldstein_world.png", dpi=150)

To see the locations of top 10% most positive and most negative events, we filtered the events with Goldstein values below the 10th percentile and above 90th percentile.

In [346]:
# Plot the longitudes and latitudes color coded according to Goldstein value
# Filter Goldstein values in the 5th percentile and below


def plot_percentiles(name, df, value, percentile):
    """Given dataframe, value to plot (str) and percentile, we plot the top and bottom percentile"""

    df_red1 = df[df[value] <= df[value].quantile(percentile)]

    y_red1 = df_red1['Actor1Geo_Lat']
    x_red1 = df_red1['Actor1Geo_Long']
    goldstein_red1 = df_red1[value]
    num_mentions_red1 = df_red1['NumMentions']
    print(f"> {value} values in the bottom {percentile} calculated successfully.")

    # Filter Values in the 95th percentile and above

    df_green1 = df[df[value] >= df[value].quantile(1 - percentile)]

    y_green1 = df_green1['Actor1Geo_Lat']
    x_green1 = df_green1['Actor1Geo_Long']
    goldstein_green1 = df_green1[value]
    num_mentions_green1 = df_green1['NumMentions']
    print(
        f"> {value} values in the top {percentile} calculated successfully.")
    print(f"> Coordinates and other valuable data to be visualized computed successfully.")

    # Values vs Latitude and Longitude
    # Color is goldstein while size is importance (num_mentions)

    # BOTTOM 5%

    plt.style.use('default')
    f, ax = plt.subplots(figsize=(11, 5))
    ax.scatter(x_red1, y_red1, c='r', marker='o', s=num_mentions /
               3, alpha=1., label=f'bottom {percentile*100}%')
    ax.scatter(x_green1, y_green1, c='g', marker='o',
               s=num_mentions/3, alpha=0.5, label=f'top {percentile*100}%')
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)
    plt.axis('equal')
    plt.xlabel('longitude')
    plt.ylabel('latitude')
    plt.title(
        f'5% Most Positive and Negative {value} scores per location from May to July 2017')
    plt.legend()
    plt.tight_layout()

    # Save figure
    plt.savefig(f"charts/scatter_{value}_{name}_top_bottom.png", dpi=150)

The chart below shows the 10% of events with most negative and 10% of events with positive Goldstein values. Notable is the distribution of events in the Middle East. Some areas in the Middle East have markedly more negative Goldstein values than positive ones.

In [347]:
plot_percentiles('world', df_events_sample, 'GoldsteinScale', 0.1)
> GoldsteinScale values in the bottom 0.1 calculated successfully.
> GoldsteinScale values in the top 0.1 calculated successfully.
> Coordinates and other valuable data to be visualized computed successfully.

Similarly, the events were plotted according to coordinates and color coded by Avg Tone (red being most negative, green being most positive, and yellow as neutral). The size of each marker represents the number of mentions of the event (importance). In general, the average tone distribution looks even globally.

In [315]:
# AvgTone values vs Latitude and Longitude
# Color is avg tone while size is importance (num_mentions)
f, ax = plt.subplots(figsize=(11,5))
ax.scatter(x, y, c=avgtone, marker='o', s=num_mentions/3, cmap='RdYlGn', alpha=0.75)
plt.axis('equal')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.title('Average Tone per event from May to July 2017')
plt.tight_layout()

# Save figure
plt.savefig("charts/scatter_avgtone_world.png", dpi=150)

The chart below shows the 10% of events with most negative and 10% of events with positive AvgTone values. Notable is the distribution of events in the Middle East and Africa. Some areas in the Middle East and Africa have markedly more negative AvgTone values than positive ones.

In [331]:
plot_percentiles('world', df_events_sample, 'AvgTone', 0.1)
> AvgTone values in the bottom 0.1 calculated successfully.
> AvgTone values in the top 0.1 calculated successfully.
> Coordinates and other valuable data to be visualized computed successfully.

Chloropleth Maps of Goldstein and Average Tone

Below, we visualized chloropleth maps of the Goldstein Scores adn AvgTone globally.

In [2]:
df = pd.read_csv('data/df_by_country_month.csv')
df_data = df[['Actor1Geo_CountryCode', 'MonthYear','avg_goldstein', 'avg_avgtone']]
df_country_map = pd.read_csv('countrymap.txt',sep='\t')
df_country_map.columns = ['Country', 'Actor1Geo_CountryCode', '3let']
In [3]:
def data_generator(time_delta_list, df, df_country_map, column_to_plot, colorbar_title):
    df_data_superset = df[['Actor1Geo_CountryCode',
                           'MonthYear', 'avg_goldstein', 'avg_avgtone']]
    df_country_map.columns = ['Country', 'Actor1Geo_CountryCode', '3let']
    data = []
    for every_timedelta in time_delta_list:
        df_data = df_data_superset.query('MonthYear=='+str(every_timedelta))
        df_to_plot = pd.merge(df_country_map, df_data, how='left', on=[
                              'Actor1Geo_CountryCode'])
        df_to_plot['avg_goldstein'] = df_to_plot['avg_goldstein'].fillna(0)
        df_to_plot['avg_avgtone'] = df_to_plot['avg_avgtone'].fillna(0)
        data.append(dict(
            visible=False,
            type='choropleth',
            locations=df_to_plot['3let'],
            z=df_to_plot[column_to_plot],
            text=df_to_plot['Country'],
            colorscale=[[0.0, 'rgb(165,0,38)'], [0.1111111111111111, 'rgb(215,48,39)'], [0.2222222222222222, 'rgb(244,109,67)'],
                        [0.3333333333333333, 'rgb(253,174,97)'], [0.4444444444444444, 'rgb(254,224,144)'], [
                0.5555555555555556, 'rgb(224,243,248)'],
                [0.6666666666666666, 'rgb(171,217,233)'], [0.7777777777777778, 'rgb(116,173,209)'], [
                0.8888888888888888, 'rgb(69,117,180)'],
                [1.0, 'rgb(49,54,149)']],
            autocolorscale=False,
            reversescale=False,
            marker=dict(
                line=dict(
                    color='rgb(180,180,180)',
                    width=0.5
                )),
            colorbar=dict(
                autotick=False,
                tickprefix='',
                title=colorbar_title),
        ))

    return data


def draw_choropleth(data_, title, df, filename):
    data_[0]['visible'] = True
    steps = []
    for i in range(len(data_)):
        step = dict(
            method='restyle',
            args=['visible', [False] * len(data_)],
            label=sorted(df['MonthYear'].unique())[i],
        )
        step['args'][1][i] = True
        steps.append(step)

    sliders = [dict(
        active=0,
        currentvalue={"prefix": "YearMonth: "},
        pad={"t": 10},
        steps=steps
    )]

    layout = dict(
        autosize=False,
        width=1000,
        height=600,
        title=title,
        dragmode='pan',
        geo=dict(
            showframe=False,
            showcoastlines=False,
            projection=dict(
                type='Mercator'
            )
        ),
        sliders=sliders
    )

    fig = dict(data=data_, layout=layout)
    iplot(fig, validate=False, filename=filename)

Notable regions with most negative Goldstein Value are South America, particularly Venezuela, African countries like Somalia, and Middle East especially Afghanistan.

In [4]:
data_ = data_generator(df=df, df_country_map=df_country_map, time_delta_list=sorted(
    df['MonthYear'].unique()), column_to_plot='avg_goldstein', colorbar_title='Goldstein value')
draw_choropleth(data_, 'Monthly Global GoldStein Value',
                df, 'goldstein-world-map')

Notable regions with most negative AvgTone values are South America, particularly Venezuela, African countries particularly Central Africa and Congo, and the Middle East like Libya, Egypt, and Iraq.

In [5]:
data_ = data_generator(df=df, df_country_map=df_country_map, time_delta_list=sorted(
    df['MonthYear'].unique()), column_to_plot='avg_avgtone', colorbar_title='AvgTone value')
draw_choropleth(data_, 'Monthly Global AvgTone Value', df, 'avgtone-world-map')

Goldstein and Average Tone per Coordinate in the Philippines

To see the events in the Philippines, we first filtered only the data for the Philippines from the whole dataset from May to July 2017.

In [351]:
# Filter Philippines dataframe
df_ph = df_[df_.Actor1Geo_CountryCode == 'RP']
print("> Selected Philippines and successfully computed")
> Selected Philippines and successfully computed

We saved the sampled dataset to a csv file.

In [352]:
# Save to csv
df_ph.compute().to_csv("data/df_ph_coord.csv")
print("> Successfully saved Ph sample to csv")
> Successfully saved Ph sample to csv

We loaded the csv file into a dataframe.

In [353]:
# Load
df_ph = pd.read_csv("data/df_ph_coord.csv")

Below are the first five rows in the loaded dataset.

In [354]:
df_ph.head()
Out[354]:
GlobalEventID Day MonthYear GoldsteinScale NumMentions NumSources NumArticles AvgTone Actor1Geo_Lat Actor1Geo_Long Actor1Geo_CountryCode Actor1Geo_Fullname goldstein * num_mentions avgtone * num_mentions
0 651554620 20170501.0 201705.0 0.0 8.0 1.0 8.0 0.281690 7.07306 125.613 RP Davao, Davao City, Philippines 0.0 2.253521
1 651554629 20170501.0 201705.0 -2.0 2.0 1.0 2.0 0.281690 7.07306 125.613 RP Davao, Davao City, Philippines -4.0 0.563380
2 651554633 20170501.0 201705.0 1.0 2.0 1.0 2.0 -3.225806 13.00000 122.000 RP Philippines 2.0 -6.451613
3 651555083 20170501.0 201705.0 3.5 8.0 1.0 8.0 0.281690 7.07306 125.613 RP Davao, Davao City, Philippines 28.0 2.253521
4 651555310 20170501.0 201705.0 4.0 10.0 1.0 10.0 -2.056075 13.00000 122.000 RP Philippines 40.0 -20.560748

We selected the latitudes and longitudes to be plotted.

In [355]:
# Get Latitude and Longitude
y2 = df_ph['Actor1Geo_Lat']
x2 = df_ph['Actor1Geo_Long']
print("> Coordinates to be visualized computed successfully.")
> Coordinates to be visualized computed successfully.

We selected the Goldstein and AvgTone values to be plotted.

In [356]:
# Get goldstein and avgtone values
goldstein2 = df_ph['GoldsteinScale']
avgtone2 = df_ph['AvgTone']
num_mentions2 = df_ph['NumMentions']
print("> Other valuable data to be visualized computed successfully.")
> Other valuable data to be visualized computed successfully.

Below is the plot of all Goldstein values of events in the Philippines during the three-month period.

In [357]:
# Goldstein values vs Latitude and Longitude
# Color is goldstein while size is importance (num_mentions)
plt.style.use('default')
f, ax = plt.subplots(figsize=(6,6))
ax.scatter(x2, y2, c=goldstein2, marker='o', s=num_mentions2/2, cmap='Spectral', alpha=0.75)
plt.axis('equal')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.title('Average Goldstein per event in the Ph from May - July 2017')
plt.tight_layout()
plt.savefig('charts/scatter_goldstein_ph.png', dpi=150)

Below is the plot of all AvgTone values of events in the Philippines during the three-month period.

In [358]:
# AvgTone values vs Latitude and Longitude
# Color is goldstein while size is importance (num_mentions)
f, ax = plt.subplots(figsize=(6,6))
ax.scatter(x2, y2, c=avgtone2, marker='o', s=num_mentions2/2, cmap='Spectral', alpha=0.75)
plt.axis('equal')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.title('Average Tone per event in the Ph from May - July 2017')
plt.tight_layout()
plt.savefig('charts/scatter_avgtone_ph.png', dpi=150)

From the chart below, we can't visually find a distinct concentration of negative or positive Goldstein values in an area.

In [359]:
plot_percentiles('ph', df_ph, 'GoldsteinScale', 0.1)
> GoldsteinScale values in the bottom 0.1 calculated successfully.
> GoldsteinScale values in the top 0.1 calculated successfully.
> Coordinates and other valuable data to be visualized computed successfully.

From the chart below, we can see a relatively high concentration of negative AvgTone compared with positive AvgTone in the Visayas and Mindanao regions.

In [360]:
plot_percentiles('ph', df_ph, 'AvgTone', 0.1)
> AvgTone values in the bottom 0.1 calculated successfully.
> AvgTone values in the top 0.1 calculated successfully.
> Coordinates and other valuable data to be visualized computed successfully.

Results

We have seen how quickly we could get insights specifically from a large dataset by using Dask. For a three month period of May to July 2017, several areas had events which had much of a consistent impact to its general political stability. Countries like Venezuela, parts of Africa, and parts of the Middle East and Central Asia (Afghanistan) had consistently the most negative Goldstein scores and AvgTone. During these months, the types of events were either terrorist attacks against public infrastructure[4], protest against the government[5], or war and conflict in those countries [6]. Among the African countries, Niger stands out to have the most positive Goldstein and Avg Tone scores. These indicate generally positive events during the three month period such as Niger's army rescuing 92 migrant workers after being left abandoned[7]. Other positive incidents in the region like freed captives[8], environmental awareness[9], show relatively higher scores which means these type of events can attribute to providing a positive impact on the general stability of a specific region. The Philippines had a generally more negative Goldstein and AvgTone scores than the global average during those three months. Negative events were more marked in Mindanao and Visayas. Those months were the heat of Marawi siege when terrorists took over the city of Marawi in Mindanao. This event prompted the proclamation of Martial Law in Mindanao and military intervention in the region especially Marawi. Marawi was declared free on October 17, 2017.[10]. The event also prompted Visayas to heighten security. [11] Further research may include expanding the scope of dates to one year or more, look for trends in instability over one year or more, and correlate the instability of the region to other data like GDP per capita and exports.

References

[1] GDELT Event Codebook V2.0 [PDF]. (2015, September 2). Gdeltproject.org. http://gdeltproject.org/ Michaelson, R. (2017, May 26).

[2]Goldstein Scale for WEIS Data. Retrieved from http://web.pdx.edu/~kinsella/jgscale.html

[3] Rocklin, M. (2015). Dask: Parallel Computation with Blocked Algorithms and Task Scheduling [PDF]. 14th Python in Science Conference. SCIPY 2015.A. (2017, May 20).

[4] Egypt launches raids in Libya after attack on Coptic Christians kills 26. Retrieved from https://www.theguardian.com/world/2017/may/26/several-killed-in-attack-on-bus-carrying-coptic-christians-in-egypt

[5] Venezuela: 50th day of protests brings central Caracas to a standstill. Retrieved from https://www.theguardian.com/world/2017/may/20/venezuela-50th-day-of-protests-brings-central-caracas-to-a-standstill

[6]"The city of Bangassou has turned into a battlefield; we fear the worst for the civilian population". Retrieved from: https://www.msf.org/central-african-republic-city-bangassou-has-turned-battlefield-we-fear-worst-civilian-population

[7] Telesur. (2017, June 14). Niger Army Rescue 92 Migrants Left for Dead in Sahara Desert. Retrieved from https://www.telesurenglish.net/news/Niger-Army-Rescue-92-Migrants-Left-for-Dead-in-Sahara-Desert-20170614-0011.html

[8] Busari, S., & Croft, J. (2017, May 08). 82 released Chibok schoolgirls arrive in capital. Retrieved from https://edition.cnn.com/2017/05/07/africa/chibok-girls-released/index.html

[9] Sebunya, K. (2017, July 31). Saving the world's wildlife is not just 'a white person thing'. Retrieved from https://www.theguardian.com/environment/africa-wild/2017/jul/31/saving-wildlife-conservation-africa-colonialism-race

[10] ABS-CBN. (2017, October 17). TIMELINE: The Battle for Marawi. Retrieved from https://news.abs-cbn.com/news/10/17/17/timeline-the-battle-for-marawi

[11] Mayol, A. et al. (2017, May 24). Alert up in Visayas amid Marawi crisis. Retrieved from https://newsinfo.inquirer.net/899312/alert-up-in-visayas-amid-marawi-crisis

Acknowledgements

We would like to acknowledge the Asian Institute of Management ACCESS Lab for the dataset, and Prof. Christian Alis for guidance.